13 research outputs found

    Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

    Full text link
    Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations. We hypothesize that these methods are not as expressive as an outer product of the visual and textual vectors. As the outer product is typically infeasible due to its high dimensionality, we instead propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and expressively combine multimodal features. We extensively evaluate MCB on the visual question answering and grounding tasks. We consistently show the benefit of MCB over ablations without MCB. For visual question answering, we present an architecture which uses MCB twice, once for predicting attention over spatial features and again to combine the attended representation with the question representation. This model outperforms the state-of-the-art on the Visual7W dataset and the VQA challenge.Comment: Accepted to EMNLP 201

    Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)

    Full text link
    Deep models are the defacto standard in visual decision problems due to their impressive performance on a wide array of visual tasks. On the other hand, their opaqueness has led to a surge of interest in explainable systems. In this work, we emphasize the importance of model explanation in various forms such as visual pointing and textual justification. The lack of data with justification annotations is one of the bottlenecks of generating multimodal explanations. Thus, we propose two large-scale datasets with annotations that visually and textually justify a classification decision for various activities, i.e. ACT-X, and for question answering, i.e. VQA-X. We also introduce a multimodal methodology for generating visual and textual explanations simultaneously. We quantitatively show that training with the textual explanations not only yields better textual justification models, but also models that better localize the evidence that support their decision.Comment: arXiv admin note: text overlap with arXiv:1612.0475

    Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

    Full text link
    Deep models that are both effective and explainable are desirable in many settings; prior explainable models have been unimodal, offering either image-based visualization of attention weights or text-based generation of post-hoc justifications. We propose a multimodal approach to explanation, and argue that the two modalities provide complementary explanatory strengths. We collect two new datasets to define and evaluate this task, and propose a novel model which can provide joint textual rationale generation and attention visualization. Our datasets define visual and textual justifications of a classification decision for activity recognition tasks (ACT-X) and for visual question answering tasks (VQA-X). We quantitatively show that training with the textual explanations not only yields better textual justification models, but also better localizes the evidence that supports the decision. We also qualitatively show cases where visual explanation is more insightful than textual explanation, and vice versa, supporting our thesis that multimodal explanation models offer significant benefits over unimodal approaches.Comment: arXiv admin note: text overlap with arXiv:1612.0475

    Vision and Language Understanding Through Generative Modeling

    No full text

    Vision and Language Understanding Through Generative Modeling

    No full text

    Statistical Analysis of Low-latitude Pi2 Pulsations Observed at Bohyun Station in Korea

    No full text
    We statistically investigated the properties of low-latitude Pi2 pulsations using Bohyun (BOH, Mlat = 29.8°, L = 1.35) ground magnetometer data in 2008. For this 1-year interval, 582 Pi2 events were identified when BOH was in the nightside from 1800 to 0600 local times. We found the following Pi2 characteristics. (1) The occurrence distribution of Pi2s is relatively constant in local times. (2) The Pi2 frequency varies in local times. That is, Pi2 pulsations in postmidnight sector had higher frequency than in premidnight sector. (3) Pi2 power in premidnight sector is stronger than in postmidnight sector. (4) Pi2 frequency has positive correlation with solar wind speed and AE index. (5) Pi2 power has not a clear correlation with solar wind parameters. This indicates that Pi2 power is not controlled by external sources. (6) It is found that the most probable-time between Pi2 onsets is Δt ~ 37.5 min: This is interpreted to be the period between Pi2 pulsations when they occur cyclically. We suggest that Δt ~ 37.5 min is the occurrence rate of reconnection of open field lines in the tail lobe

    Flavones: An important scaffold for medicinal chemistry

    No full text
    corecore